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ABSTRACT 

We present an approach, called the Shadow Method, for the identification of disease loci from dense 
genetic marker maps in complex, potentially incomplete pedigrees. Shadow is a simple method based 
on an analysis of the patterns of obligate meiotic recombination events in genotypic data. This method 
can be applied to any high density marker map and was specifically designed to exploit the fact that 
extremely dense marker maps are becoming more readily available. We also describe how to interpret 
and associate meaningful P- Values to the results. Shadow has significant advantages over traditional 
parametric linkage analysis methods in that it can be readily applied even in cases in which the topology 
of a pedigree or pedigrees can only be partially determined. In addition, Shadow is robust to variability in a 
range of parameters and in particular does not require prior knowledge of mode of inheritance, penetrance 
or clinical misdiagnosis rate. Shadow can be used for any SNP data, but is especially effective when 
applied to dense samplings. Our primary example uses data from Affymetrix 100k SNPChip samples in 
which we illustrate our approach by analyzing simulated data as well as genome-wide SNP data from 
two pedigrees with inherited forms of kidney failure, one of which is compared with a typical LOD score 
analysis. 

Subject headings: SNP, LOD score, complex pedigree 



1. Introduction 

Studies of genetic disease have been remarkably 
successful in identifying disease genes and novel bio- 
logical pathways. For family-based analyses of pheno- 
types with single, highly penetrant disease alleles, the 
first step is the identification of a locus harboring the 
mutant allele. This requires the acquisition and subse- 
quent analysis of a significant amount of genetic data. 
As regards the former, the ease with which investiga- 
tors can accomplish genome-wide genotyping has in- 
creased tremendously in recent years. For example, 
one commercial microarray technology (Affymetrix 
SNPChip) now allows rapid chip-based genotyping of 
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approximately 10,000,100,000, and 500,000 SNPs 
(see Matsuzaki et al (0) and H)). 

Most of the currently available linkage approaches 
were originally developed with the goal of extract- 
ing as much information as possible from a relatively 
small set of markers. We base our approach on the 
fact that with very dense genetic maps, we can ignore 
markers that are not fully informative and still extract 
most of the useful genetic information. In essence, 
our method is based on identifying obligate recombi- 
nation events and using the distribution of these events 
to identify genomic regions inherited identical by de- 
scent (IBD). This allows us to handle the complicated 
requirements of real data and the often complex and 
incompletely known structures of available pedigrees. 
We call our technique the Shadow Method and intro- 
duce it in the next section. 

Our motivation for the development of Shadow is 
severalfold. Perhaps most important is the fact that 
available software is overmatched by the great number 
of computations required in order to calculate paramet- 
ric or non-parametric LOD scores for large pedigrees 
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and large data sets. It is known that using standard 
methods, the size of the calculation (as measured in 
the number of arithmetic operations) increases expo- 
nentially in pedigree size or number of markers used 
(the various elaborations of the Elston-Stewart algo- 
rithm as in Ott (|6|) and the NPL algorithm as Kruglyak 
et al (3) respectively). In contrast, the computational 
load of Shadow only grows linearly with the number 
of markers and at a rate that is less than exponential in 
pedigree size. In the worst case scenario, it increases 
exponentially in size, but is independent of 

pedigree size. This enables us to analyze large pedi- 
grees. 

Computational complexity is just one concern. We 
are also cognizant of the fact that in analyses of large 
complex pedigrees, it can be extremely useful for in- 
vestigators to have an index of which regions are most 
likely to harbor disease genes by virtue of the of shar- 
ing regions IBD in affected individuals, as well as a 
measure, given data from a subset of a pedigree, of 
distance from IBD for any region of the genome. This 
relies on the computation of something we call the 
Shadow function at the locus x, denoted S(x). It is 
effectively a measure of just how inconsistent the data 
is with the hypothesis that the pattern of inheritance 
at a given locus is from IBD. In particular, S(x) = 
implies IBD at x. 

Thus, the Shadow method is a conceptually and 
computationally simple technique with several fea- 
tures that we believe make it useful for the analysis 
of large, complex, and perhaps incomplete pedigrees, 
particularly for relatively rare diseases caused by un- 
common genetic variants of large effect: (1) Shadow 
enables rapid identification of genetic regions most 
likely to harbor IBD regions in pedigrees; (2) Shadow 
measures how inconsistent such regions (and in fact all 
regions) are from being IBD; and (3) Shadow helps to 
identify the source of such inconsistencies in "almost 
IBD" regions. We also develop methods to assess how 
likely we are to find such IBD or "almost IBD" re- 
gions by chance. The specifics of this measure and the 
details of its interpretation are presented in the next 
section. 

We illustrate the use of Shadow by analyzing both 
simulated data as well as genome-wide SNP data from 
two pedigrees with inherited forms of kidney disease. 



In this paper we draw the distinction between the pedigree members 
and the samples, the latter of which are those people in the pedigree 
for whom we possess a genotyped DNA sample. 



The pedigrees are illustrated in Figure Q] The family 
FS-Z has a relatively simple pedigree and it is known 
that the responsible gene defect is a point mutation 
in the TRPC6 gene on chromosome llq (Reiser et al 
12)). In this case a full multi-point linkage analysis will 
work well and we compare our results to a LOD score 
analysis. The second family we analyze, the FG-FM 
family, has an incomplete and large pedigree, a situa- 
tion which makes standard linkage approaches unreli- 
able and/or impossible. 

2. The Shadow Function - Measuring distance 
from IBD 

2.1. Definition of Shadow function 

At the core of the Shadow method is the idea that 
the sample data provides us with a means to mea- 
sure for each locus x the degree to which the data 
is inconsistent with the hypothesis that the region 
around x is IBD and thus is possibly within a disease- 
harboring allele. We call this measure the Shadow 
function and denote it as S. Since we focus on in- 
consistency, a locus x that is consistent with the IBD 
assumption has S(x) = 0, reflecting that it is distance 
from being IBD. 

To articulate this distance we use the familiar notion 
of an inheritance vector, as introduced in Kruglyak et 
al J2I). Recall, an inheritance vector v is a vector of 
ones and zeros that tells us which copy of a marker is 
passed on during a particular meiosis process in our 
pedigree. In particular, if we label one of the chromo- 
somes in each homologous chromosomal pair with a 
zero and the other with a one, then we have an inheri- 
tance vector v(x) defined at each locus x. The value of 
S approximates the minimal number of changes in the 
inheritance vector necessary for v(x) to be ideally con- 
sistent with a disease allele being located at that point 
x. Since our examples use only affected samples, this 
gives us an estimate of the minimal number of changes 
in the inheritance vector necessary for inheritance vec- 
tor at x to be IBD. In Section |4] we explain how to 
include controls. 

The exact sense of distance is captured by the fol- 
lowing definition: 

Definition: For a given inheritance vector v, let 
m(v) denote the minimal number of changes (bit-flips) 
necessary to make the vector IBD. We call a partition 
of our sample^ consistent with a given inheritance 

2 A partition of the set of samples is simply its decomposition into a 
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vector v if the samples in each part of the partition 
are IBD from some common founder using v. We let 
Part(v) be the set of the partitions consistent with v. 
Similarly we denote as Inh(P) the set of inheritance 
vectors with which P is consistent. Then we define S 
to be 

S(x) = min rain m(v) ) . 

P£Part(v(x)) \v£lnh(P) J 

For example, for a simple pedigree with autoso- 
mal dominant inheritance and 0% phenocopy rate, 
^(disease locus) = 0. 

Figure 12 gives us a first illustration of the function 
5. The Shadow in Figure |2] was constructed from a 
simulated FS-Z family assumed to have the disease 
at 1 morgan from the p end of the chromosome 1 1 . 
That is, we ran twenty simulations of allele segrega- 
tion in chromosome 11 consistent with the pedigree 
for FS-Z and a disease locus at the TRPC6 locus and 
chose two IBD regions for illustrative purposes. Hence 
5(Ch 11,1 morgan) = since this location is fully 
consistent with harboring a disease allele. Each time a 
crossover occurs in meiosis, there is a change in the in- 
heritance vector. There are crossovers on both sides of 
the disease locus, and as we move from the disease lo- 
cus past such a crossover the value of 5 goes from to 
1. In general, the "corners" of the Shadow curve (see 
Figure|2| will represent crossovers that have had an ef- 
fect on what our data will look like from the point of 
view of our samples. Notice that we have used a con- 
vention where the distance-axis (y-axis) has its min- 
imal value of at the top and increases as we move 
down the axis. 

In real data, at the disease locus x the Shadow may 
have 5 > 0. In such a case, the value of 5 is easy to 
interpret. Namely 

S(x) = #of inconsistencies 

where an inconsistency may be either an unanticipated 
founder or a person who has an indistinguishable phe- 
notype but not a disease allele (i.e., a phenocopy). 

2.7.7. How to use S - the P -Value 

The use of 5 is very similar to the use of the LOD 
score function LOD(x). Figure [3] compares the two. 
In particular, if we knew 5 but not the disease locus 
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(or loci), then we would identify the region(s) in the 
genome where 5 is minimal are likely candidates. The 
next step in the evaluation of such regions is to deter- 
mine how likely it is that such a scenario is the result 
of chance alone. We call the probability of this sce- 
nario being due to chance alone the event's P -Value. 
If the P- Value is small, then we can conclude with 
some specific computed probability that the certainty 
that a disease locus is in this region and interpret the 
value of 5 at this point as the number of inconsisten- 
cies. As with the LOD score method (or any method) 
if this P- Value is large then it will be difficult to distin- 
guish a disease allele-harboring region from a chance 
IBD region and this will lead to a high false positive 
rate. In Section |A]we review the process of estimating 
the P- Value. For example, if we make the definitions 

chi = Morgan length of chromosome* 

and 

B = ^(Branches in collapsed pedigree) 
for a tree we estimate 

22 

In general, the size and complexity of the pedigree will 
influence the P- Value and hence the number of incon- 
sistencies that can exist at a true disease locus in the 
given family before this method will give false pos- 
itives. Consider the pedigrees in Figure [TJ We find 
that in the FS-Z pedigree, the presence of any incon- 
sistency will be fatal. By contrast, in the FG-FM fam- 
ily a single inconsistency would still yield significant 
results. Specifically, a region where 5=1 could still 
be regarded as likely to harbor a disease allele, given 
the existence of one inconsistency as defined above. 

2.2. Approximation of the Shadow function 

In practice we do not have access to the actual 
Shadow, but an approximation denoted as Sm- The 
idea behind this approximation is very simple, namely 
that we can identify obligate recombination events be- 
tween two individuals if they have incompatible alleles 
at some marker. With our SNP data if we see that Per- 
son 1 has alleles AA at the same SNP locus where Per- 
son 2 has alleles BB then we have an obligate recombi- 
nation event. With a very dense and polymorphic SNP 
map, we can be reasonably sure that if a sufficiently 
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long consecutive stretch of SNPs occurs without such 
an obligate recombination event, then these individu- 
als share (at least) 1 chromosomal region identical by 
descent. We say that such a streak of markers is con- 
sistent with a partition P (of samples) if each part of 
the partition contains no obligate recombination events 
throughout the streak. 

We will view the streak as a non-coincidence if it 
exceed the critical length of M markers (how to choose 
M is explored at length in Section [3~2l ). 

Definition: Let PartM(x) be the set of partitions 
with the property that there exists a streak of length at 
least M and containing x that is consistent with this 
partition. Then we define Sm to be 

Sm(x) = ruin min m(v) ) . 

P&Part M (x) \veInh(P) J 

Denser marker maps allow us to obtain better and 
better approximations to the true Shadow. Figure |2] 
shows a sequence of such approximations for simu- 
lated 10k, 100k and 500k SNP data for our FS-Z fam- 
ily. We applied our 10k approximation to the real FS- 
Z 10k data, and found a unique interval on chromo- 
some 1 1 where Sss (x) took on its minimal value of 
as seen in Figure [3] (the choice of 58 is discussed 
in section 12.3. U . As published in ($3), this is the 
location of the TRPC6 gene that harbors the disease 
causing allele. The Shadow curve $200 0*0 for the FG- 
FM family can be seen in Figure [5] Here we see a 
unique S200 {%) = 1 interval on chromosome 22. In 
Section [A] we see that P « i for such a region oc- 
curring somewhere in the genome, and so its existence 
is statistically significant and would be our best can- 
didate for a disease harboring gene locus. In the Sec- 
tion [3] we explore and sharpen this FG-FM candidate 
using the Shadow Method. 

2.3. Analysis of the Shadow 

Notice in the definition of Sm we only consider 
partitions which are consistent with a streak of length 
greater than or equal to M. We encounter two poten- 
tial problems when choosing M. For M too large, we 
run the risk of false negatives. We quantify this with 
what we call Q-Value as introduced in Section l2.3.1l 
For M too small, we encounter false positives, as dis- 
cussed in Section l2.3.2l In Section [3~l2"1 we see that us- 
ing these notions we can make sensible choices for M. 
In Section l3~2l we will also see that as the number of 
SNPs gets larger it will be possible to choose M so that 



there is simultaneously a very small chance of a false 
positive and a very small chance of a false negative. 

2.3.1. False Negatives 

Figure |2] shows that for the 10k and 100k SNP sets 
there are regions where S exaggerates how far x is 
from being in an IBD region, a situation that will lead 
to false negatives in our hunt for disease loci. In fact, 
in both the 10k and 100k examples we see that the 
method entirely missed the small IBD region to the left 
of the IBD region harboring the disease-causing allele 
at x = 1, We would like to compute the probability 
that we miss the true disease-allele harboring region. 
We call this probability the Q-Value and find in Sec- 
tion |B]that if we define 

G = chi 

N = #(SNPmarkers) 

then we have 

/ MBG \ MSG 

Q = l-(l + — je— . 

Fixing the Q-Value is a very natural way to choose M. 
For example, in Figure [2] for the 100k simulation we 
chose M = 103 since this corresponds to Q = 0.05 
and for 500k we chose M = 217 since this corre- 
sponds to Q = 0.01 . We chose M = 58 for the 
10k data since it corresponds to detecting a region that 
is at least as long as the expected length of a disease 
causing region. 

A reasonable question might be: Why did we not 
simply choose them all so that Q = 0.01? The prob- 
lem is that then the 10k and 100k analyses will then 
become cluttered with false positives, the subject of 
the next section. 

2.3.2. False Positives 

Notice in Figure [2] that in the 10k and 100k cases 
there are regions where the value of S exaggerates how 
close we are to being at an IBD region, a situation that 
will lead to false positives in our hunt for disease loci. 
In Figure |4] we see an example of an Sm = false 
positive region, and we can estimate the probability of 
such a false positive in the genome as follows. Let 

S = ^(Samples for which we have SNP data) 

p w P(More Likely SNP Allele) 



4 



Pj = pq((l - p S ~ j )(l - q j ) + (1 - q S - j )(l - p>)) 
Pmax = max{pj I 1 < j < [S/2\ } 
p min = min{pj | 1 < j < [S/2\} 
then we have that this false positive rate FP satisfies 

FP< Np max {l-p mm ) M . 

Notice, the Q-Value improves as 4£ decreases. On 
the other hand, the larger the choice of M the smaller 
the false positive rate. Hence there is a balance be- 
tween making M large in order to shrink the noise and 
making 4? small in order to shrink the Q-Value. Ex- 
plicit examples of this balancing act are given in Sec- 
tion E2 




Fig. 1. — Examples of collapsed pedigrees of the families analyzed here. A collapsed pedigree only includes people 
for whom there exist a genotyped DNA sample (in green), the non-founders (in blue), and founders that in the ideal 
disease associated scenario contributed a disease allele (in red). Red edges are hypothetical and a, b, and c represent 
the number of non-founders along the hypothetical edge. For the FG-FM family this is a minimally complicated 
pedigree consistent with a unique "red" founder and in our analysis we assume a = c = 1 and 6 = 2. 
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Fig. 2. — Here we see a simulation the genetic process in the FS-Z family on chromosome 11 (see Section lAl. We 
have plotted the simulation's Shadow using a red line. The black curves are approximations of the Shadow using: (A) 
10k SNPs data, then (B) 100k SNPs data. We choose our Shadow by simulating the genetic process 20 times and 
picked one with a a second chance IBD region for illustrative purposes. That we expect such chance IBD regions is 
due in part to the fact that the P- Value for an IBD region is large for the FS-Z pedigree (see SectionlAl. 
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10K, Expected Length Shadow Approximation 
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Fig. 3. — The black curve is S^s for the real FS-Z family data on chromosome 1 1 . In the next Figure, we compare S^g 
to LOD(x) in red as computed by the Genehuneter2 program as implemented by the dCHIP program (see Kruglyak 
et al ®) and Leykin (jH)). The LOD score values are on the right hand y-axis. 
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Fig. 4. — Here we see an example of a false positive IBD region in our FS-Z family arising in a simulation. 
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Fig. 5. — The black curve outlines £200 (a?) i n black for the whole genome and the FG-FM family's actual SNP data. 
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3. The Shadow Method 

For both the 10k data on family FS-Z and the 100k 
data on the FG-FM family it is impossible to simul- 
taneously make Q and FP small. By taking a more 
careful look at the data we are still able to reduce the 
Q-Value. 

We use the fact that near a disease locus, 5 has a 
very distinct tiered or "wedding cake" shape. More 
precisely a true 5 = region sits on top of an S = 1 
region which sits on top of an 5 — 2 region and 
so on, each layer requiring at least a pair of obligate 
crossovers to make the transitions between the tiers. 
In Figure[3]we see a very typical example. This struc- 
ture allows us to detect an 5 — level by searching for 
a cake with a long 5 = 1 region as its top layer. This 
technique will work best when such candidate regions 
are themselves rare, for example in the FG-FM fam- 
ily. Armed with such candidates we can take a more 
detailed look at the definition of 5. Namely, we notice 
that when approximating the Shadow, at each point in 
the genome we obtain a list of partitions of the samples 
that are compatible with the data and these partitions 
can be used to provide greater insight into the disease 
loci. It is the full use of this information that is called 
the Shadow Method. The analysis of these partitions 
takes two primary forms that we will now explore. 

The first case is as in Figure [7j where we see an 
example of a large 5=1 region on chromosome 2 
in the FS-Z data that looks a lot like a cake missing 
its top layer. We find that the left half is given by the 
partition consisting of {113, 114, 115} and its comple- 
ment, while the right half is determined by the single- 
ton {213} and its complement, and in the middle the 
two partitions are both consistent. This is indicated 
schematically on the right hand side of Figure How 
can this happen? The most likely possibility, as illus- 
trated in Figure [7] is that there is an IBD region sepa- 
rated by crossovers as indicated. Whenever an 5 = k 
region is comprised of a pair partitions which differ 
by incompatible obligate crossovers that intersects in 
a region compatible with the removal of these obligate 
crossovers, we can deduce the likely existence of an 
5 = k — 1 region. 

This method also applies to the FG-FM family, 
though in a second weaker form. Once again we will 
explore the possibility of an IBD region in the Sm = 1 
candidate region. In this case there is a unique parti- 
tion that gives our candidate Sm = 1 interval and it is 
composed of the samples {6i, &12C111} and this set's 



complement. If we believe that an IBD region might 
be present, then we would conclude that the true pedi- 
gree is more likely to look something like the pedi- 
gree in Figure [6J with the relatively large number of 
non-founders d + b compared with the number of non- 
founders a + c. With d + b relatively large there are 
many chances for crossovers near the disease locus and 
hence the IBD region may be quite small. To explore 
this possibility, we can look at an approximation with 
a better Q- Value, like 550(2;) as in Figure [8] Using 
550(2;) we find a candidate IBD region. The assump- 
tion that b + d is relatively large compared with a + c 
makes plausible the scenario for the IBD region's ex- 
istence as pictured in the right half of Figure [8] Fur- 
thermore, this IBD streak has a length of 60 markers 
and having a streak this long inside of our length 283 
5=1 region by chance is unlikely. Namely, in Sec- 
tion|C] we find that the probability of a streak this long 
or longer due to chance is less than 0.12. While this 
argument is not as convincing as the earlier example 
with the FS-Z family (where the 5 = 1 region was 
comprised of two partitions), this still gives us a good 
first place to look for a disease allele. 

A key aspect of this method that we still need to 
discuss, is how to choose M. The ideas is to choose if 
possible an M that simultaneously makes the chance 
of false positives and false negatives using the full 
Shadow Method small. In the next section, we esti- 
mate the false negative rate using the Shadow Method 
and in Section 13.21 demonstrate how to use the false 
negative rate to choose M. 

3.1. False negatives revisited 

By applying the Shadow Method (and not simply 
attending to the regions where 5 = 0) reduces the 
false negative rate. We call this improved estimate of 
the false negative rate the FN- Value. Notice, Q cor- 
responds to the false negative rate using just a streak 
analysis, while FN corresponds to the false negative 
rate using the full Shadow Method. In Section iBl we 
find that 



FN = 1 
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In Section l3T2l we will quantify the extent to which this 
method enhances the use of 5 via some examples. This 
improvement in the false negative rate is the motiva- 
tion behind the introduction of the full Shadow Method 
(as opposed to performing only a longest streak analy- 
sis). 
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3.2. Estimates 

First let us review. The following parameters will 
be considered: 

M = Streak length lacking obligate recombinations 

B = Branches in collapsed pedigree 

N = #(SNPmarkers) 

D = ^(Samples for which we have SNP data) 

p « P(More Likely Allele). 

It is possible to assess from these parameters the po- 
tential effectiveness of the relevant Shadow Method. 
For example, using M = 200 and N = lOOfc we find 
for our example pedigrees: 
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These values indicate with M = 200, the FS-Z 
pedigree would be handled very nicely via 100k SNP 
marker sets since both FN and FP are reduced below 
— !-q (though the P- Value for this family is weak and 
we would expect that we would need to carefully try 
to list all the IBD regions, see Section IaTI ) For the 
FG-FM family we see that M = 200 has a small FP 
but a rather large FN. If we were to try M = 50 we 
have the opposite problem 
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in which FP is very large and FN is small. Ex- 
ploring these values we see that we must make a com- 
promise. For example, for M = 100 we have: 
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If we don't wish to compromise we will need to use 
a denser mapping. For example, using 500k SNPs and 
M = 200 we have: 
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Hence we see in this case the extra SNPs would re- 
ally pay off. 

These estimates also give a sense of the future for 
SNP technology. It is widely estimated that on av- 
erage, two genomes differ at 1 in 1000 nucleotides 
(i.e., approximately 3 million variants per genome). 
Hence, it is quite reasonable that we may find 5000k 
reasonably informative SNPs. In this case a Shandow 
based approach applied to a collapsed pedigree with 50 
members, of which 10 are affected and sampled, then 
using M = 190, both Q and FP would be less than 
1/500. 

3.3. Assumptions and Caveats 

Here we discuss the assumptions that underlie our 
analysis. We assume that the markers occur ran- 
domly (with respect to morgan measure) throughout 
the genome and that the rates in the founder popula- 
tion of the more common marker alleles behave as if 
they were randomly distributed among the SNPs. Vi- 
olations of these assumptions will make some IBD re- 
gions easier to find and some harder. Moreover, it is 
well known that such a random independent distribu- 
tion is not going to be accurate SNP rates at which 
linkage disequilibirum is observed (see Altshuler (0)) 
and the SNPs in haplotypes contain less information 
do to the violations of independence. 

Another simplifying assumption we make is that we 
can make a reasonable choice of a collapsed pedigree 
with a common founder. Of course on some scale, 
many ancient founders of all or most of the affected 
samples will exist. However, most such founders are 
too genetically distant to be picked up with our meth- 
ods. It is also much less likely that one of these al- 
ternate distant founders has introduced a disease allele 
into our population, at least for rare diseases caused by 
alleles of strong effect. 

In general, the need to apply the full Shadow 
Method will become less necessary to as the marker 
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densities increase and the Q-Value shrinks. However 
our estimation techniques rely on assumptions which 
are reasonable for the current SNP densities but may 
hamper the exploration of very large pedigrees with 
very dense SNPs. For example, this 500k and 5000K 
estimates form the previous section assume that the 
more common of the two SNP alleles occurs on aver- 
age no more than about 85% of the time in the founder 
population, which is true in our lOfc and lOOfc sam- 
ples but may increase as SNP density increases hence 
increasing FP (see SectionP. 
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Fig. 6. — On the left we see the low noise S2oo{x) on chromosome 22 in our FG-FM family. The partition of the 
samples responsible for the Sm = 1 region is {61, 612cm} and this set's complement. On the right we see a version 
of the FG-FM pedigree consistent with a disease locus on chromosome 22 as discussed in Section|3] 



10K, Expected Length Shadow Approximation 




Fig. 7. — Here we see the next most promising candidate region in FS-Z, the large 5 = 1 region on chromosome 2. In 
Section[3] we conclude the likely existence of an IBD region inside this 5 = 1 region. In the right half of this of this 
Figure we see plausible positions in the pedigree for the obligate crossovers (indicated with yellow squares) necessary 
to form the indicated partitions as determined by the Shadow Method. 
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Fig. 8. — On the left we see the noisier Sr M (x) that exposes our most likely candidate IBD region. As in Figure[7] on 
the right we explore the plausible IBD region. 





5B TOO T56~ 
Number of SNPs 



Fig. 9. — "The Disease Paradox": Assuming no a priori knowledge of the location of the disease loci, we have that 
the chance of the disease being located in any given crossover interval is proportional to the length of the interval. 
Hence the probability density function (pdf) of the length of the interval containing the disease fDisease(l) satisfies 
f Diseased) « IfchanceQ) where fchance{l) is the pdf of the of length of a chance IBD region. The green curve in 
this figure is the distribution of the SNP length of a chance IBD region, while black curve is the distribution of the 
SNP length of a region which is IBD because the disease is conditioned to be there. These curves were derived using 
the independence model discussed in SectionlBl 
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4. Computational Methods 

The main purpose of this section is to discuss the 
complexity of the algorithm used to compute the 
Shadow and perform the Shadow Method. To make 
this analysis we use the parameters reviewed in Sec- 
tion [372] together with the definitions: 

T = # of branches remaining upon removal of the 
non-genotyped pedigree members ( as in the pedigrees 
on the left hand side of Figure\8§ 

and 

H = the maximal number of inconsistencies that we 
will be considering in the computation of the Shadow. 

For example, for the FS-Z family T = 7, we choose 
H = 3, and D = 6 (recall D is the number of sam- 
ples). 

The algorithm requires knowledge of the confi- 
dence call for a SNP, and one must choose how to 
throw away suspicious measurements. This parame- 
ter is important since the Shadow Method is not robust 
under SNP miscalls. (We used parent/child compar- 
isons to help interpret this error rate and found that 
a cutoff of 0.01 using the Affymetrix confidence call 
works well.) 

The analysis of the Shadow presented in the previ- 
ous section made use of an approximate pedigree. This 
pedigree should be used only if there is a great deal 
of confidence that the disease allele is likely to be af- 
fecting the samples via these known relationships. The 
real power of the Shadow Method is that it allows us to 
be more flexible if we are uncertain about the pedigree 
or of the pedigree's role in the spread of the disease. 
Any likely pedigree can be used, but the complex- 
ity increases with each possibility. Denote as Pedn 
the collection of all partitions with m(v) < H — 1 
in the pedigree(s) of interest. Then the complexity is 
0(N |Pedff|). If there is just a single pedigree to in- 
vestigate then we have the universal bound |Pedjj| < 
J2k=a CD- F° r a tre^| this bound is sharp, and for 
(the tree) FS-Z \Ped 3 \ = 29. Notice if we have a list 
of candidate collapsed pedigrees then Pedn is easy to 
construct by simply adding in crossovers to the pedi- 
grees and recording the resulting partitions. Typically, 
given a pedigree in which there is confidence both in 
the pedigree structure and clinical data, then this algo- 



3 The complexity of an algorithm is the number of arithmetic opera- 
tions required. 

4 We mean here "tree" in the graph theoretic sense - that is a graph 
without loops 



rithm will work for very large pedigrees and number 
of samples (certainly T and S both less than 22 will 
work). 

However, in practice the Shadow Method will be 
most useful when pedigree information is missing. If 
we are completely open-minded about the pedigree 
structure, then \Pedjj \ is less than or equal to the num- 
ber of partitions of a set of S elements into H — 1 or 
fewer parts (and this number of parts determines the 

m(v)). In other words, \Pedn\ = J2k=i | fc^ j" 

where / ^ X is a Stirling number of the second kind. 

This sum grows exponentially and nearly at the rate 
(H — 1) D . For the FG-FM analysis, we performed a 
completely open-minded analysis and choose H = 4. 
On our machines, we could not exceed not exceed 
5=11 and H = 4 with 100k data. One important dif- 
ference between this method and other forms of link- 
age analysis is that the size of the pedigree does not af- 
fect computational speed. Rather the number of sam- 
ples studies (irrespective of the structure of the pedi- 
gree) determines computational size and speed. This 
will allow for the analysis for very large and compli- 
cated pedigrees. 

Comments About Controls: This algorithm (and 
the Shadow Method itself) can be altered to incorpo- 
rate controls. For example: Call a region ideally con- 
sistent with disease (ICD) if the affected samples are 
IBD in this region and the unaffected samples are not 
related to each other or the affecteds in this region. 
Then we can search for how far we are from an ICD 
region using the same exact techniques as that we de- 
signed to search for how far we are form an IBD re- 
gion. For example, the partition having a part for each 
unaffected and a part for all the affected would now 
have S = 0, and we would be estimating the distance 
from this situation, be said that in this case, more in- 
consistencies should be expected, since the penetrance 
rate for a single allele might be very low. (Especially 
if the disease is recessive. However, at a potentially re- 
cessive allele the algorithm can be modified to break a 
streak if an AB is observed, hence isolating the region 
around a recessive disease locus. Under the assump- 
tion that both mutant alleles are the same, which is of 
course a very strong assumption.) To optimally exploit 
sibling and parental controls with such a streak analy- 
sis requires more work and a haplotyping version of 
the method, work we hope to describe this in future 
paper. 
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5. Discussion 

We have described a simple method for identifying 
disease gene loci in pedigrees using dense genetic data. 
We believe this method has several strengths. In any 
family-based study designed to identify loci harboring 
rare alleles of strong effect, the goal is to identify a 
genetic locus (or loci) harboring alleles cosegregating 
with a phenotype of interest. The Shadow function de- 
fined here gives an intuitive interpretation of dense ge- 
netic data. At each point in the genome, Shadow tells 
us how inconsistent that point is from being located in 
a genetic region shared by a group of phenotypically 
"affected" individuals. These 5 = regions are sim- 
ilar to regions where the LOD score reaches its maxi- 
mum attainable value. However, in contrast to a LOD 
score, the Shadow is not itself a likelihood ratio. Thus, 
for a family consisting of a single pair of affected sibs, 
5 = for half of the genome, and 5=1 for the other 
half. In this method, statistical significance is assessed 
separately. We assign to each value of 5 a P- Value 
which describes the probability of seeing this value by 
chance. We also generate FP and FN values, so that 
we can assess the chances of a false positive and false 
negative using this method. In turn, these estimates 
allow us to make a priori estimates of an appropriate 
choice of the key parameter M, the length of a streak 
of markers lacking obligate recombination events. 

In addition, the Shadow Method helps us identify 
the cause of deviations from 5 = regions. For exam- 
ple, in a genome-wide analysis, we may find no 5 = 
region, but a small number of 5 = 1 regions. We 
can specifically examine the nature of the one incon- 
sistency in each such region to help us evaluate the 
plausibility that a phenotype-causing allele is in fact 
present. 

This method has limitations. There is certainly no 
practical reason to use Shadow to analyze a pedigree of 
the size of FS-Z where a standard linkage analysis with 
a map of only moderate density will work well. While 
at the currently routinely available SNP map densities 
(such as Affymetrix 10k and 100k SNPChips) the as- 
sumptions we use in our analysis appear reasonable, 
we must hope that the nature and quality of SNPs does 
not change in significant ways as densities increase or 
our estimation method will fail to make good sense and 
will need to be modified. 

As noted, we plan to develop further refinements 
of this methodology allowing the incorporation of a 
greater fraction of the available genetic information 



as well as data from controls and unaffected family 
members. However, in its present form, we believe 
Shadow will have immediate value in the analysis of 
genetic data in complex family studies in which tradi- 
tional linkage analysis calculations are problematic. 
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A. P-Value 



Here we explain how to approximate the required P- Values. We carefully justify our computation in the basic 
case where P is the probability of an IBD region (S = region) and the approximate pedigree is a tree; we then 
explain how modify the answer for other values of S and more complicated pedigrees. The first observation is that 
this P- Value is bounded by the expected number of IBD regions, and it is this quantity that we compute. 

For each pair of spousal founders there are four chromosomes which could be responsible for a given IBD region. 
We fix one of these four possibilities for the i th chromosome and call a region of the samples IBD relative to it IBD;. 
Recall chi = E{Ci) where C, is the total number of crossovers during a meiosis process on the i th chromosome. We 
have 

P(|IBDs|) = 4E(E^i #( IBD * s )) 

= ± 1 £? 1 E(E{#QED#)\Ci = N)) 

= ^ =1 E(W) 

_ . v^22 Bcht + 1 
- 4 ,L 4 =1— 2"B ■ 

Notice this estimate of the P- Value is exponentially decreasing. In particular, if the collapsed pedigree is a tree 
then for more than 16 branches the chance of a chance IBD is less than 5 percent, and if the number of branches is 
greater than 20 then the P- Value of an chance IBD region is less than 1 in 500. 

For a non-tree the numerator of ^b- would become the number of collections of crossover events that still leave our 
samples IBD. For the FG-FM pedigree, we have one loop and find that number of collections of crossover events that 
still leave our samples IBD equals 2 2 + 2 2 — 1. So the expected number of IBD regions is less than 25^0 an d nence 
the P- Value associated to an IBD candidate is bounded by — ■ For a general S — k, we must list all the partitions 
that are consistent with k or fewer obligate crossovers and then count all the collections of crossover events that can 
result in such partitions. We find that the probability of an S = 1 region in our FG-FM family is less than 

A.l. When the P- Value is high 

Using the computation in the previous section, in FS-Z we find that we expect 1.4 IBD regions from this pedigree 
other than the one due to the disease. This explains why we should not be surprised to find at least two IBD regions 
(as we see in Figures [3] and |7). In general, it important to list all the candidate regions when the P- Value is not 
small. For example, under the assumption that we have no information about the location of our disease loci before 
the experiment, we have the following theorem: 

Key Theorem: Assuming the disease is in D = {x | S = k}, the probability that a given interval in D contains 
the disease marker is proportional to that region's length in D. 

For example, using the Shadow Method we find three good candidate IBD region in the FS-Z family (the ones on 
chromosomes 1 1 and 2 and another on chromosome 16) with morgan lengths roughly 0.2, 0.05 and 0.05. So assuming 
the disease loci was in an IBD region, this key theorem tells us that we should have assigned an a priori probability of 
roughly 2/3 of the disease loci being in chromosome 1 1 region (where it turned out to actually be). 



B. False Negatives 

Here we approximate the Q and the FN. To compute the Q, we first notice that the end points of the interval 
corresponding to the interval with a disease locus are dictated by crossovers in the collapsed pedigree. The length of 
a randomly selected such interval will have a length distribution corresponding to the distribution of the length of a 
chance IBD region, and we denote the probability density function (pdf) of this length as fchance(l)- As it turns out, 
the disease is more likely to be in a longer interval. This fact sometimes goes by the name of the "Bus Paradox", which 
for the purposes (and context) of this paper, we will call rename the "Disease Paradox". This is illustrated in Figure|9] 
where we find the pdf of the interval containing a disease loci is / Diseased) cx lfchance(l)- 

To determine fchance(l) requires a choice of model for the meiosis process. We use the standard independence 
assumption that underlies the linkage analysis approach as developed in Lander et al and call this model the 
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independence model. Here is a brief review. We can view each chromosome as an interval with subintervals each 
associated to an inheritance vector where neighboring inheritance vectors differ by exactly one change to the vector. 
We call such an interval a crossover interval when we restrict our attention to the collapse pedigree (with respect to any 
one of the four founding chromosomes). We need to decide how to choose the endpoints of these subintervals. There 
is a natural measure on each chromosome which assigns to each interval the expected number of cuts during a meiosis 
process, called the Morgan measure. In the case of multiple cuts during meiosis, the positions are not independently 
chosen with respect to the Morgan measure (this is due to interference), but if we view the meiosis process associated 
to distinct individuals in our pedigree as independent and note that the expected number of cuts per individual is small, 
then a Poisson process should give an excellent approximation when examining even a moderate size pedigree. This 
approximation is equivalent to the well studied Markov assumption as utilized in most forms of linkage analysis and 
as developed in (Lander et al (Q])) and in (Kruglyak et al Q). This model is not directly utilized in the formulation of 
our algorithm, but only utilized in order to analyze the results and it is also how we simulated the genetic process. (We 
assumed independent founders and choose the cuts via this Poisson process.) 

Under our Poisson assumption fchance(l) = Be~ Bl , hence J 'Diseased) = B 2 le~ Bl . To estimate the Q- Value, first 
note that M SNPs corresponds roughly to m = morgans, and hence under these assumptions we can approximate 
the the Q-Value via 

Q = B 2 f™le~ Bl dl 
= J Bm le- l dl 
= 1- (1 + Bm)e- Bm . 

To approximate FN we can look one layer down in the tree. To do so, let L denote the length of the region to the 
IBD region's left and R the length of the region to its right. We have 

FN = C P(R <(m-l) and L < (m - l))f Dlsease (l)dl 

= IT P{R < (m - l))P(L < (m - l))f Dl sease{l)dl 
= _ e -B(m-l)y B 2 le -Bl dl 

= 1 - (2 + {mBf - e- mB )e- rnB . 



C. False Positives 

To explore noise we need to articulate a model of the SNPs themselves. We let SNP^ denote the value of the ith SNP 
Each SNP; comes in one of two flavors, A or B. It is perhaps more useful to think of them labeled instead as Less and 
More, representing the less and more common alleles. To model the distribution of the SNPs we would like to assign 
a value of More with a probability that approximates the rate at which More would occur in a population of founders. 
Let us call this probability P(SNP; = More). Note that if we intend to use a parametric maximum likelihood method 
(as in (Kruglyak et al ®))), then it would be wise to carefully explore this distribution. However, for our purposes we 
feel some simplifying assumptions are reasonable, namely that the population from which founders are drawn is large 
enough so that the SNP; are independent, and that p = P(SNPj = More) is independent of i. We also used these 
assumptions when simulating of SNPs. We acknowledge that these are very serious assumptions and will become 
falser and falser for denser and desner maps, as dicussed in Section [331 

As introduced in Section [2.3.21 Noise is comprised of streaks of data accidentally consistent with a given partition 
making Sm(x) > S(x). Hence, we will define a unit of noise a to be a streak of length greater or equal to M where 
Sm(x) > S(x) throughout this streak. We let Noise in a region be the total number of such units of noise in that 
region. 

Let us start with an example of estimating noise from the FG-FM family by examining carefully our S = 1 region 
on chromosome 22 in the FG-FM family. Here we have an interval of length L = 283 markers consistent with a 
partition of our 8 samples with two parts, one of size 2 and the other size 6. Under our simplifying assumptions, we 
claim that the expect noise in the the M = 50 Shadow is given by 

P(Noise) <L Pn (l- Pn ) M , 
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where p n = pq((l - p 6 )(l - q 2 ) + (1 - q 6 )(l - p 2 )). 

Proof: The observation is simple. We assign the position in our length L region a value of 1 if it starts a streak of 
length greater than or equal to M and a otherswise and denote the quantity as Posi. Each Posi is version of the 
random variable Pos which is equal to 1 if M+l flips of the coin are such that the first result is a tail, and the next M 
give heads, this with a probability of tails equal to p n . Hence E(Pos) — Lp n (l — p n ) M ■ 

L—M L-M-l 

S(Noise) = E( Posi) = ^ E(Pos l ) < LE(Pos) = Lp n (l - p n ) M 

i=l i=l 

Now we need to estimate the probability of tails. In order to have a tails outcome, we need the chromosomes that 
are IBD for each part Allele(Part 1) ^ Allele(Part 2) and that at least one of the other founder chromosomes in each 
part takes on the same allele value as the chromosome that is IBD for this part. Hence 

p n = p(l - p 6 )q(l - q 2 ) + q(l - q 6 )p(l - p 2 )) 

as claimed. 

QED 

Of course to actually compute it we need an approximation of p. To do so, we first note that the probability that the 
alleles are different, P(AB), equals 2p(l — p). We can use our data to approximate P(AB) and solve this quadratic 
to find p w 0.84. (This corresponds to a maximum likelihood estimate of the parameter.) 

If we use the whole genome as our region then FN < E(Noise) and it is this relationship we will use to bound 
FN. To give a nice bound we use the special case of a tree, though only role this plays is in insuring that the chance 
of an accidental consistent set of markers when 5 < 1 is less than the this chance when 5=1. Hence we can use the 
5 = 1 case with a unique corresponding partition to bound this probability. So we can assume there are two parts in 
our partition of our D samples and hence the probability of success at any give point is bounded above and below by 

Pmax = max{ OT ((l -p D - j )(l - J) + (1 - q D ~ j ){l - pP)) | 1 < j < \_D/2\} 

Pmm = min{pq((l-p D -i)(l - <?) + (1 - q D - j )(l-p>)) \l<j< [D/2\} 

and the same argument as above tells us that 

FP < E(WD Noise) < N Prnax (l - p min ) M \ 

and hence will give us a sense for the expected noise. 

In general, such estimates tells us that if N is big enough we do not need to be very careful in analyzing our data 
and the Shadow Method will work great. For example letting M = ^/N we can see that as SNP density gets thicker 
the percentage of the genome where the Sm and 5 disagree quickly goes to zero as N goes infinity. However as 
discussed in Section 13.31 as the marker density increases the assumptions that underlie our estimate will be become 
less and less realistic (especially the independence of the markers), and caution is required. 



D. Key Theorem 

Set Up: Let D be a set, let rp be a process that selects a random point from D, and let RS be a process that selects 
a random subset of D. 

For Simplicity: Assume D is finite and that P(x € RS) ^ for all x (like the real human genome). 
Definition: Let (RS | x £ RS) denote the result of the process conditioned to contain x. 
Lemma: Upon witnessing E = (RS | rp e RS) we have 

P(rp = x | (RS | rp G RS) = E) ~ p^^i xE^), 
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where \e is the indicator function on E. 
Proof: Recall Bayes Theorem 

P(A I B) = P(B I A)^l 

and notice 

P((RS \xe RS) = E) = P(RS = E \ x G RS). 
From these observations, we have that P(rp = x \ (RS | rp G RS) = E) 

P(rp = x) 



= P{{RS\rp£RS) = E\rp = x 

= P{(RS \xeRS) = E) 

= P(RS = E \ x e RS) 

= P{x G RS | RS = E) 



P((RS | rp G RS) = E) 
P(rp = x) 



P((RS | rp G RS) = E) 
P(rp = x) 
P((RS | rp G RS) = E) 
P(RS = E) P(rp = x) 



P(x G RS) P((RS | rp G RS) = E) 
f P(RS = E) \ ( P(rp = x) \ 
XE(X) \P((RS | rp G RS) = E)J \P(x G RS) J 

as asserted. 

QED 

Comment: This lemma captures the intuitive fact that if a point is relatively unlikely to be in RS but turns up in 
E, then this point is more likely to be the point upon which RS was conditioned. This could be useful in situations 
in which there is a great deal of prior information regarding the disease loci. However when applying this lemma to 
derive the key theorem we assume that the disease's location is a priori totally unknown (so P{rp = x) is independent 
of x) and that make the Medelian assumption (P(x G RS) is independent of x). 
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